Technical Requirements

You may need to install Cairo on your operating system to run this notebook. To learn more see the Cairo documentation at https://www.cairographics.org/download/.

if(!require(Cairo)) install.packages("Cairo", repos = "http://cran.us.r-project.org")
if(!require(caret)) install.packages("caret", repos = "http://cran.us.r-project.org")
if(!require(car)) install.packages("car", repos = "http://cran.us.r-project.org")
library(readr)
library(ggplot2)
library(knitr)
library(tidyverse)
library(caret)
library(leaps)
library(car)
library(mice)
library(scales)
library(RColorBrewer)
library(plotly)

Introduction

Predicting Housing Prices

The purpose of this project is to predict the price of houses in California in 1990 based on a number of possible location-based predictors, including latitude, longitude, and information about houses within a particular block.

While this project focuses on prediction we are fully aware and want you the reader to also be aware that housing prices have increased dramatically since 1990, when the data was collected. This model should not be used to predict the actual future. This is a purely academic endeavor to explore statistical prediction.

The goal of the project is to create the model that can best predict home prices in California given reasonable test/train splits in the data.

California Housing Prices Dataset

We’re using the California Housing Prices dataset from the following Kaggle site: https://www.kaggle.com/camnugent/california-housing-prices. This data pertains to the houses found in a given California district and some summary stats about them based on the 1990 census data.

We loaded housing.csv into R.

housing_data = read_csv("housing.csv")
## Parsed with column specification:
## cols(
##   longitude = col_double(),
##   latitude = col_double(),
##   housing_median_age = col_double(),
##   total_rooms = col_double(),
##   total_bedrooms = col_double(),
##   population = col_double(),
##   households = col_double(),
##   median_income = col_double(),
##   median_house_value = col_double(),
##   ocean_proximity = col_character()
## )
housing_data$median_house_value[1:100]
##   [1] 452600 358500 352100 341300 342200 269700 299200 241400 226700 261100
##  [11] 281500 241800 213500 191300 159200 140000 152500 155500 158700 162900
##  [21] 147500 159800 113900  99700 132600 107500  93800 105500 108900 132000
##  [31] 122300 115200 110400 104900 109700  97200 104500 103900 191400 176000
##  [41] 155400 150000 118800 188800 184400 182300 142500 137500 187500 112500
##  [51] 171900  93800  97500 104200  87500  83100  87500  85300  80300  60000
##  [61]  75700  75000  86100  76100  73500  78400  84400  81300  85000 129200
##  [71]  82500  95200  75000  67500 137500 177500 102100 108300 112500 131300
##  [81] 162500 112500 112500 137500 118800  98200 118800 162500 137500 500001
##  [91] 162500 137500 162500 187500 179200 130000 183800 125000 170000 193100

The dataset contains 20640 observations and 10 attributes (9 predictors and 1 response). Below is a list of the variables with descriptions taken from the original Kaggle site given above.

  • longitude: A measure of how far west a house is; a higher value is farther west
  • latitude: A measure of how far north a house is; a higher value is farther north
  • housing_median_age: Median age of a house within a block; a lower number is a newer building
  • total_rooms: Total number of rooms within a block
  • total_bedrooms: Total number of bedrooms within a block
  • population: Total number of people residing within a block
  • households: Total number of households, a group of people residing within a home unit, for a block
  • median_income: Median income for households within a block of houses (measured in tens of thousands of US Dollars)
  • ocean_proximity: Location of the house w.r.t ocean/sea
  • median_house_value: Median house value for households within a block (measured in US Dollars)

This dataset meets all of the stated criteria for the project including:

  • A minimum 200 observations
  • A numeric response variable: median_house_value
  • At least one categorical predictor: ocean_proximity
  • At least two numeric predictors: the remaining attributes

Let’s look at a summary of each variable:

summary(housing_data)
##    longitude         latitude     housing_median_age  total_rooms   
##  Min.   :-124.3   Min.   :32.54   Min.   : 1.00      Min.   :    2  
##  1st Qu.:-121.8   1st Qu.:33.93   1st Qu.:18.00      1st Qu.: 1448  
##  Median :-118.5   Median :34.26   Median :29.00      Median : 2127  
##  Mean   :-119.6   Mean   :35.63   Mean   :28.64      Mean   : 2636  
##  3rd Qu.:-118.0   3rd Qu.:37.71   3rd Qu.:37.00      3rd Qu.: 3148  
##  Max.   :-114.3   Max.   :41.95   Max.   :52.00      Max.   :39320  
##                                                                     
##  total_bedrooms     population      households     median_income    
##  Min.   :   1.0   Min.   :    3   Min.   :   1.0   Min.   : 0.4999  
##  1st Qu.: 296.0   1st Qu.:  787   1st Qu.: 280.0   1st Qu.: 2.5634  
##  Median : 435.0   Median : 1166   Median : 409.0   Median : 3.5348  
##  Mean   : 537.9   Mean   : 1425   Mean   : 499.5   Mean   : 3.8707  
##  3rd Qu.: 647.0   3rd Qu.: 1725   3rd Qu.: 605.0   3rd Qu.: 4.7432  
##  Max.   :6445.0   Max.   :35682   Max.   :6082.0   Max.   :15.0001  
##  NA's   :207                                                        
##  median_house_value ocean_proximity   
##  Min.   : 14999     Length:20640      
##  1st Qu.:119600     Class :character  
##  Median :179700     Mode  :character  
##  Mean   :206856                       
##  3rd Qu.:264725                       
##  Max.   :500001                       
## 

Note that the total_bedrooms variable has 207 NA values. We will address this issue in the Data Cleaning section in Methods.

Below is a visual representation of all data points in the dataset with longitude on the x-axis, latitude on the y-axis, and median_house_value shown in a color codes.

plot_map = ggplot(housing_data, 
                  aes(x = longitude, y = latitude, color = median_house_value, 
                      hma = housing_median_age, tr = total_rooms, tb = total_bedrooms,
                      hh = households, mi = median_income)) +
              geom_point(aes(size = population), alpha = 0.4) +
              xlab("Longitude") +
              ylab("Latitude") +
              ggtitle("Data Map - Longtitude vs Latitude and Associated Variables") +
              theme(plot.title = element_text(hjust = 0.5)) +
              scale_color_distiller(palette = "Paired", labels = comma) +
              labs(color = "Median House Value (in $USD)", size = "Population")
plot_map_tt = ggplotly(plot_map)
plot_map_tt

We see that on average the houses nearest to the ocean tend to have higher median house values, whereas those inland have the lower median values. This difference is quite substantial and tells us that the variable ocean_proximity will likely play a large role in predicting median house value.

It’s worth noting that some ocean-adjacent locations, like the more isolated towns along California’s northern most coast, have lower median house prices than other coastal housing, and some inland locations, like the towns near destinations such as Yosemite National Park and Lake Tahoe, do have higher median house prices than other inland housing. Also interesting to note that the houses closest to large metropolitan areas such as San Francisco and Los Angeles have among the highest median house values, whereas the largely agricultural Central Valley has among the lowest. This lets us know that we cannot rely on ocean_proximity alone to tell the whole story.

Methods

Data Cleaning

Categorical Variables

Initial exploration of the data showed us that there were a few steps we needed to take to make the data more useable. Firstly, we changed the categorical variable ocean_proximity from text-based to a factor variable.

housing_data$ocean_proximity = as.factor(housing_data$ocean_proximity)
levels(housing_data$ocean_proximity)
## [1] "<1H OCEAN"  "INLAND"     "ISLAND"     "NEAR BAY"   "NEAR OCEAN"

Taking a deeper dive into ocean_proximity, we see that the level ISLAND has a very low count compared to the other levels.

ggplot(housing_data, aes(x = factor(ocean_proximity))) +
  geom_bar(stat = "count", color = "black", fill = "blue")

We looked at the total count of each level to get a better idea of the skew in ocean_proximity.

summary(housing_data$ocean_proximity)
##  <1H OCEAN     INLAND     ISLAND   NEAR BAY NEAR OCEAN 
##       9136       6551          5       2290       2658

In fact, ISLAND only has five rows, where every other level has over 2,000 rows. Due to possible issues with fitting models, we decided to remove this level from ocean_proximity. We accepted the risk that our model will less accurately predict the value of houses on islands.

housing_data = housing_data[housing_data$ocean_proximity != "ISLAND", ]

Missing Data

The other thing to consider is missing data.

sum(is.na(housing_data))
## [1] 207
total_bedrooms = housing_data$total_bedrooms
sum(is.na(total_bedrooms))
## [1] 207

There are \(207\) observations with missing data for total_bedrooms. One thing we can do to solve this issue of NA values in total_bedrooms is to impute data points. We first want to take a look at the distribution of this variable to see which imputation method will work best.

bedroom_mean = mean(housing_data$total_bedrooms, na.rm=TRUE)
bedroom_median = median(housing_data$total_bedrooms, na.rm=TRUE)
ggplot(housing_data, aes(x = total_bedrooms)) +
  geom_histogram(bins = 40, color = "black", fill = "blue") +
  geom_vline(aes(xintercept = bedroom_mean, color = "Mean"), lwd = 1.5) +
  geom_vline(aes(xintercept = bedroom_median, color = "Median"), lwd = 1.5) +
  xlab("Total Bedrooms") +
  ylab("Frequency") +
  ggtitle("Histogram of Total Bedrooms (noncontinuous variable)") +
  scale_color_manual(name = "Summary Stats", labels = c("Mean", "Median"), values = c("red", "green"))
## Warning: Removed 207 rows containing non-finite values (stat_bin).

Looking at the distribution of the data in the histogram above, we decided to use the median of the total_bedrooms variable for our imputation method.

housing_data$total_bedrooms[is.na(housing_data$total_bedrooms)] = bedroom_median

Post-Cleaning

Looking at the structure of the dataset after cleaning the data, we see that besides the one factor variable ocean_proximity, we have nine numeric variables, three of which are continuous (longitude, latitude, and median_income) and six of which are discrete (housing_median_age, total_rooms, total_bedrooms, population, households, and median_house_value).

str(housing_data)
## Classes 'tbl_df', 'tbl' and 'data.frame':    20635 obs. of  10 variables:
##  $ longitude         : num  -122 -122 -122 -122 -122 ...
##  $ latitude          : num  37.9 37.9 37.9 37.9 37.9 ...
##  $ housing_median_age: num  41 21 52 52 52 52 52 52 42 52 ...
##  $ total_rooms       : num  880 7099 1467 1274 1627 ...
##  $ total_bedrooms    : num  129 1106 190 235 280 ...
##  $ population        : num  322 2401 496 558 565 ...
##  $ households        : num  126 1138 177 219 259 ...
##  $ median_income     : num  8.33 8.3 7.26 5.64 3.85 ...
##  $ median_house_value: num  452600 358500 352100 341300 342200 ...
##  $ ocean_proximity   : Factor w/ 5 levels "<1H OCEAN","INLAND",..: 4 4 4 4 4 4 4 4 4 4 ...

For a better sense of the distribution of the nine numeric variables, we looked at histograms for each of them.

par(mfrow = c(3, 3))
hist(housing_data$longitude, breaks = 20, main = "longitude", border="darkorange", col="dodgerblue")
hist(housing_data$latitude, breaks = 20, main = "latitude", border="darkorange", col="dodgerblue")
hist(housing_data$housing_median_age, breaks = 20, main = "housing_median_age", border="darkorange", col="dodgerblue")
hist(housing_data$total_rooms, breaks = 20, main = "total_rooms", border="darkorange", col="dodgerblue")
hist(housing_data$total_bedrooms, breaks = 20, main = "total_bedrooms", border="darkorange", col="dodgerblue")
hist(housing_data$population, breaks = 20, main = "population", border="darkorange", col="dodgerblue")
hist(housing_data$households, breaks = 20, main = "households", border="darkorange", col="dodgerblue")
hist(housing_data$median_income, breaks = 20, main = "median_income", border="darkorange", col="dodgerblue")
hist(housing_data$median_house_value, breaks = 20, main = "median_house_value", border="darkorange", col="dodgerblue")

To better understand the relationship between the all the variables, we looked at the pairs.

pairs(housing_data, col = "dodgerblue")

It looks like there are some additional variables which may not be necessary due to collinearity. We looked further into the correlations between the numeric variables and showed them in the table below:

housing_data_nc = housing_data[, -10]#remove text variable for now

corrmatrix = cor(housing_data_nc)

kable(t(corrmatrix))
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
longitude 1.0000000 -0.9246761 -0.1083936 0.0446418 0.0691635 0.0998814 0.0554001 -0.0150904 -0.0462078
latitude -0.9246761 1.0000000 0.0114618 -0.0362306 -0.0665653 -0.1089785 -0.0711986 -0.0799766 -0.1438369
housing_median_age -0.1083936 0.0114618 1.0000000 -0.3612684 -0.3190613 -0.2961722 -0.3028629 -0.1189487 0.1052718
total_rooms 0.0446418 -0.0362306 -0.3612684 1.0000000 0.9270603 0.8571174 0.9184803 0.1979912 0.1343733
total_bedrooms 0.0691635 -0.0665653 -0.3190613 0.9270603 1.0000000 0.8735459 0.9743781 -0.0076601 0.0495614
population 0.0998814 -0.1089785 -0.2961722 0.8571174 0.8735459 1.0000000 0.9072131 0.0047371 -0.0244208
households 0.0554001 -0.0711986 -0.3028629 0.9184803 0.9743781 0.9072131 1.0000000 0.0129502 0.0660691
median_income -0.0150904 -0.0799766 -0.1189487 0.1979912 -0.0076601 0.0047371 0.0129502 1.0000000 0.6885627
median_house_value -0.0462078 -0.1438369 0.1052718 0.1343733 0.0495614 -0.0244208 0.0660691 0.6885627 1.0000000
#highcorr = findCorrelation(corrmatrix, cutoff = .60)#this will give you highly correlated variables

TODOs:

  • Does the section above fit where it is?
  • Josh will possibly create new variable of total bedrooms to households

Training and Test Data

First we want to split the data into training and testing data. We will use an 70/30 split of a randomized sample. We will use a set seed to get repeatability.

Note that we decided to partition using ocean_proximity since this is a predictor that we believe will play a large role in predicting housing prices.

set.seed(420)
housing_trn_idx = createDataPartition(housing_data$ocean_proximity, p = .70, list = FALSE)
## Warning in createDataPartition(housing_data$ocean_proximity, p = 0.7, list
## = FALSE): Some classes have no records ( ISLAND ) and these will be ignored
housing_trn_data = housing_data[housing_trn_idx, ]
housing_tst_data = housing_data[-housing_trn_idx, ]

Model Selection

Using all Model Subsets

As a starting point we want to get an idea which parameters are good ones to use in a potential model.

We can use leaps to tell us the “best” model, which is the one with the lowest Adjusted \(R^2\) for each number of parameters. Through testing we found this to select the full additive model when given the full additive as a starting point. It is more interesting when we use the full additive model with all two-way interactions.

Note from Jeanette - the plot / legend / data in the section below are showing up very weird for me. Maybe we can chat about this section this weekend? I have a few questions, too.

regsubsets.out = regsubsets(median_house_value ~ (longitude + latitude + housing_median_age + total_rooms + population + median_income + ocean_proximity) ^ 2,
               data = housing_trn_data,
               nbest = 1,     # 1 best model for each number of predictors
               nvmax = 10,    # NULL for no limit on number of variables
               force.in = NULL, force.out = NULL, method = "exhaustive", really.big=T)
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax,
## force.in = force.in, : 7 linear dependencies found
## Reordering variables and trying again:
# The output of this is long and not very pretty, there is better output below.
#regsubsets2.out

summary.out = summary(regsubsets.out)
dont_print_df = as.data.frame(summary.out$outmat)
## Adjusted R2
plot(regsubsets.out, scale = "adjr2", main = "Adjusted R^2")

#layout(matrix(1:2, ncol = 2))
## Adjusted R2
res.legend = subsets(regsubsets.out, statistic="adjr2", legend = FALSE, min.size = 5, main = "Adjusted R^2")
## Mallow Cp - A small value of Cp means that the model is relatively precise
#res.legend = subsets(regsubsets2.out, statistic="cp", legend = FALSE, min.size = 5, main = "Mallow Cp")
abline(a = 1, b = 1, lty = 2)

res.legend
##                                               Abbreviation
## longitude                                               ln
## latitude                                                lt
## housing_median_age                                    hs__
## total_rooms                                            tt_
## population                                              pp
## median_income                                          md_
## ocean_proximityINLAND                                 o_IN
## ocean_proximityNEAR BAY                                 oB
## ocean_proximityNEAR OCEAN                               oO
## longitude:latitude                                 lngtd:l
## longitude:housing_median_age                         ln:__
## longitude:total_rooms                             lngtd:t_
## longitude:population                               lngtd:p
## longitude:median_income                           lngtd:m_
## longitude:ocean_proximityINLAND                 ln:_INLAND
## longitude:ocean_proximityNEAR BAY              ln:_NEARBAY
## longitude:ocean_proximityNEAR OCEAN          ln:_NEAROCEAN
## latitude:housing_median_age                          lt:__
## latitude:total_rooms                               lttd:t_
## latitude:population                                    lt:
## latitude:median_income                             lttd:m_
## latitude:ocean_proximityINLAND                  lt:_INLAND
## latitude:ocean_proximityNEAR BAY               lt:_NEARBAY
## latitude:ocean_proximityNEAR OCEAN           lt:_NEAROCEAN
## housing_median_age:total_rooms               hsng_mdn_g:t_
## housing_median_age:population                        hs__:
## housing_median_age:median_income             hsng_mdn_g:m_
## housing_median_age:ocean_proximityINLAND           h__:_IN
## housing_median_age:ocean_proximityNEAR BAY              hB
## housing_median_age:ocean_proximityNEAR OCEAN            hO
## total_rooms:population                                tt_:
## total_rooms:median_income                            tt_:_
## total_rooms:ocean_proximityINLAND                   t_:_IN
## total_rooms:ocean_proximityNEAR BAY                     tB
## total_rooms:ocean_proximityNEAR OCEAN                   tO
## population:median_income                              pp:_
## population:ocean_proximityINLAND                     p:_IN
## population:ocean_proximityNEAR BAY                      pB
## population:ocean_proximityNEAR OCEAN                    pO
## median_income:ocean_proximityINLAND                 m_:_IN
## median_income:ocean_proximityNEAR BAY                   mB
## median_income:ocean_proximityNEAR OCEAN                 mO
## ocean_proximityISLAND                                 o_IS
## longitude:ocean_proximityISLAND                 ln:_ISLAND
## latitude:ocean_proximityISLAND                  lt:_ISLAND
## housing_median_age:ocean_proximityISLAND           h__:_IS
## total_rooms:ocean_proximityISLAND                   t_:_IS
## population:ocean_proximityISLAND                     p:_IS
## median_income:ocean_proximityISLAND                 m_:_IS
which.max(summary.out$adjr2)
## [1] 11
summary.out$which[which.max(summary.out$adjr2),]
##                                  (Intercept) 
##                                         TRUE 
##                                    longitude 
##                                         TRUE 
##                                     latitude 
##                                         TRUE 
##                           housing_median_age 
##                                        FALSE 
##                                  total_rooms 
##                                        FALSE 
##                                   population 
##                                        FALSE 
##                                median_income 
##                                        FALSE 
##                        ocean_proximityINLAND 
##                                        FALSE 
##                        ocean_proximityISLAND 
##                                        FALSE 
##                      ocean_proximityNEAR BAY 
##                                         TRUE 
##                    ocean_proximityNEAR OCEAN 
##                                        FALSE 
##                           longitude:latitude 
##                                         TRUE 
##                 longitude:housing_median_age 
##                                        FALSE 
##                        longitude:total_rooms 
##                                        FALSE 
##                         longitude:population 
##                                        FALSE 
##                      longitude:median_income 
##                                         TRUE 
##              longitude:ocean_proximityINLAND 
##                                         TRUE 
##              longitude:ocean_proximityISLAND 
##                                        FALSE 
##            longitude:ocean_proximityNEAR BAY 
##                                         TRUE 
##          longitude:ocean_proximityNEAR OCEAN 
##                                        FALSE 
##                  latitude:housing_median_age 
##                                        FALSE 
##                         latitude:total_rooms 
##                                        FALSE 
##                          latitude:population 
##                                        FALSE 
##                       latitude:median_income 
##                                        FALSE 
##               latitude:ocean_proximityINLAND 
##                                        FALSE 
##               latitude:ocean_proximityISLAND 
##                                        FALSE 
##             latitude:ocean_proximityNEAR BAY 
##                                         TRUE 
##           latitude:ocean_proximityNEAR OCEAN 
##                                        FALSE 
##               housing_median_age:total_rooms 
##                                         TRUE 
##                housing_median_age:population 
##                                        FALSE 
##             housing_median_age:median_income 
##                                        FALSE 
##     housing_median_age:ocean_proximityINLAND 
##                                        FALSE 
##     housing_median_age:ocean_proximityISLAND 
##                                        FALSE 
##   housing_median_age:ocean_proximityNEAR BAY 
##                                        FALSE 
## housing_median_age:ocean_proximityNEAR OCEAN 
##                                        FALSE 
##                       total_rooms:population 
##                                        FALSE 
##                    total_rooms:median_income 
##                                         TRUE 
##            total_rooms:ocean_proximityINLAND 
##                                        FALSE 
##            total_rooms:ocean_proximityISLAND 
##                                        FALSE 
##          total_rooms:ocean_proximityNEAR BAY 
##                                        FALSE 
##        total_rooms:ocean_proximityNEAR OCEAN 
##                                        FALSE 
##                     population:median_income 
##                                         TRUE 
##             population:ocean_proximityINLAND 
##                                        FALSE 
##             population:ocean_proximityISLAND 
##                                        FALSE 
##           population:ocean_proximityNEAR BAY 
##                                        FALSE 
##         population:ocean_proximityNEAR OCEAN 
##                                        FALSE 
##          median_income:ocean_proximityINLAND 
##                                        FALSE 
##          median_income:ocean_proximityISLAND 
##                                        FALSE 
##        median_income:ocean_proximityNEAR BAY 
##                                        FALSE 
##      median_income:ocean_proximityNEAR OCEAN 
##                                        FALSE
best_adjr2 = summary.out$adjr2[which.max(summary.out$adjr2)]

From this information we see that there are three models which have an Adjusted \(R^2\) close to the maximum: \(0.6645571\). We’ll use these three models as a starting point for model selection, as well as include a fourth model that we will introduce below.

best_leaps_model_1 = lm(median_house_value ~ longitude + latitude + I(ocean_proximity == "INLAND") + I(ocean_proximity == "NEAR BAY") + longitude:latitude + longitude:median_income + longitude:I(ocean_proximity == "NEAR BAY") + housing_median_age:total_rooms + housing_median_age:population + population:median_income, data = housing_trn_data)
#summary(best_leaps_model_1)
names(coef(best_leaps_model_1))[-1]
##  [1] "longitude"                                       
##  [2] "latitude"                                        
##  [3] "I(ocean_proximity == \"INLAND\")TRUE"            
##  [4] "I(ocean_proximity == \"NEAR BAY\")TRUE"          
##  [5] "longitude:latitude"                              
##  [6] "longitude:median_income"                         
##  [7] "longitude:I(ocean_proximity == \"NEAR BAY\")TRUE"
##  [8] "housing_median_age:total_rooms"                  
##  [9] "housing_median_age:population"                   
## [10] "median_income:population"

The first model to explore, best_leaps_model_1, uses the following predictors:

  • longitude
  • latitude
  • I(ocean_proximity == \"INLAND\")TRUE
  • I(ocean_proximity == \"NEAR BAY\")TRUE
  • longitude:latitude
  • longitude:median_income
  • longitude:I(ocean_proximity == \"NEAR BAY\")TRUE
  • housing_median_age:total_rooms
  • housing_median_age:population
  • median_income:population

This is an interesting set of predictors and they make a lot of sense. As we saw in the graph in the Dataset section titled Data Map - Longtitude vs Latitude and Associated Variables, it looks like housing prices are influenced by large metropolitan areas such as San Francisco and Los Angeles, where longitude, latitude and ocean_proximity would all play a large role. Also, median_income may also be a proxy for the larger cities versus the more rural areas.

The second model is explore, best_leaps_model_2, has an Adjusted \(R^2\) of \(0.6645571\), which ties the Adjusted \(R^2\) of best_leaps_model_1. To extract this model we are using the graph of the Adjusted \(R^2\) to get the parameters from the second row from the top. There are similarities, but a couple differences here.

best_leaps_model_2 = lm(median_house_value ~ housing_median_age + I(ocean_proximity == "INLAND") + I(ocean_proximity == "NEAR BAY") + longitude:housing_median_age + longitude:median_income + longitude:I(ocean_proximity == "NEAR BAY") + housing_median_age:total_rooms + housing_median_age:population, data = housing_trn_data)
summary(best_leaps_model_2)
## 
## Call:
## lm(formula = median_house_value ~ housing_median_age + I(ocean_proximity == 
##     "INLAND") + I(ocean_proximity == "NEAR BAY") + longitude:housing_median_age + 
##     longitude:median_income + longitude:I(ocean_proximity == 
##     "NEAR BAY") + housing_median_age:total_rooms + housing_median_age:population, 
##     data = housing_trn_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -263456  -43938  -11418   29327  474940 
## 
## Coefficients:
##                                                  Estimate Std. Error
## (Intercept)                                     7.501e+04  2.434e+03
## housing_median_age                              5.528e+02  1.374e+03
## I(ocean_proximity == "INLAND")TRUE             -8.331e+04  1.411e+03
## I(ocean_proximity == "NEAR BAY")TRUE           -2.913e+07  1.475e+06
## housing_median_age:longitude                   -9.579e-02  1.147e+01
## longitude:median_income                        -2.811e+02  2.925e+00
## I(ocean_proximity == "NEAR BAY")TRUE:longitude -2.382e+05  1.207e+04
## housing_median_age:total_rooms                  7.705e-01  2.473e-02
## housing_median_age:population                  -1.047e+00  4.187e-02
##                                                t value Pr(>|t|)    
## (Intercept)                                     30.812   <2e-16 ***
## housing_median_age                               0.402    0.688    
## I(ocean_proximity == "INLAND")TRUE             -59.049   <2e-16 ***
## I(ocean_proximity == "NEAR BAY")TRUE           -19.748   <2e-16 ***
## housing_median_age:longitude                    -0.008    0.993    
## longitude:median_income                        -96.104   <2e-16 ***
## I(ocean_proximity == "NEAR BAY")TRUE:longitude -19.744   <2e-16 ***
## housing_median_age:total_rooms                  31.152   <2e-16 ***
## housing_median_age:population                  -25.017   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 69720 on 14437 degrees of freedom
## Multiple R-squared:  0.638,  Adjusted R-squared:  0.6378 
## F-statistic:  3180 on 8 and 14437 DF,  p-value: < 2.2e-16
names(coef(best_leaps_model_2))[-1]
## [1] "housing_median_age"                              
## [2] "I(ocean_proximity == \"INLAND\")TRUE"            
## [3] "I(ocean_proximity == \"NEAR BAY\")TRUE"          
## [4] "housing_median_age:longitude"                    
## [5] "longitude:median_income"                         
## [6] "I(ocean_proximity == \"NEAR BAY\")TRUE:longitude"
## [7] "housing_median_age:total_rooms"                  
## [8] "housing_median_age:population"

The best_leaps_model_2 uses the following predictors:

  • housing_median_age
  • I(ocean_proximity == \"INLAND\")TRUE
  • I(ocean_proximity == \"NEAR BAY\")TRUE
  • housing_median_age:longitude
  • longitude:median_income
  • I(ocean_proximity == \"NEAR BAY\")TRUE:longitude
  • housing_median_age:total_rooms
  • housing_median_age:population

We see in this model, housing_median_age seems much more influential than in best_leaps_model_1. The ocean_proximity, however, is still quite important here.

The third model we’ll explore, best_leaps_model_3, also scored an adjusted \(R^2\) of \(0.6645571\).

best_leaps_model_3 = lm(median_house_value ~ median_income + I(ocean_proximity == "INLAND") + I(ocean_proximity == "NEAR BAY") + longitude:median_income + latitude:median_income + longitude:I(ocean_proximity == "NEAR BAY") + housing_median_age:total_rooms + housing_median_age:population, data = housing_trn_data)
summary(best_leaps_model_3)
## 
## Call:
## lm(formula = median_house_value ~ median_income + I(ocean_proximity == 
##     "INLAND") + I(ocean_proximity == "NEAR BAY") + longitude:median_income + 
##     latitude:median_income + longitude:I(ocean_proximity == "NEAR BAY") + 
##     housing_median_age:total_rooms + housing_median_age:population, 
##     data = housing_trn_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -276300  -41906  -12186   28334  448358 
## 
## Coefficients:
##                                                  Estimate Std. Error
## (Intercept)                                     8.887e+04  1.925e+03
## median_income                                  -6.143e+05  2.516e+04
## I(ocean_proximity == "INLAND")TRUE             -5.354e+04  1.919e+03
## I(ocean_proximity == "NEAR BAY")TRUE           -2.880e+07  1.435e+06
## median_income:longitude                        -7.659e+03  2.973e+02
## median_income:latitude                         -7.607e+03  3.090e+02
## I(ocean_proximity == "NEAR BAY")TRUE:longitude -2.356e+05  1.174e+04
## housing_median_age:total_rooms                  7.987e-01  2.402e-02
## housing_median_age:population                  -1.046e+00  4.077e-02
##                                                t value Pr(>|t|)    
## (Intercept)                                      46.18   <2e-16 ***
## median_income                                   -24.42   <2e-16 ***
## I(ocean_proximity == "INLAND")TRUE              -27.90   <2e-16 ***
## I(ocean_proximity == "NEAR BAY")TRUE            -20.07   <2e-16 ***
## median_income:longitude                         -25.76   <2e-16 ***
## median_income:latitude                          -24.62   <2e-16 ***
## I(ocean_proximity == "NEAR BAY")TRUE:longitude  -20.07   <2e-16 ***
## housing_median_age:total_rooms                   33.25   <2e-16 ***
## housing_median_age:population                   -25.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 68530 on 14437 degrees of freedom
## Multiple R-squared:  0.6502, Adjusted R-squared:   0.65 
## F-statistic:  3355 on 8 and 14437 DF,  p-value: < 2.2e-16
names(coef(best_leaps_model_3))[-1]
## [1] "median_income"                                   
## [2] "I(ocean_proximity == \"INLAND\")TRUE"            
## [3] "I(ocean_proximity == \"NEAR BAY\")TRUE"          
## [4] "median_income:longitude"                         
## [5] "median_income:latitude"                          
## [6] "I(ocean_proximity == \"NEAR BAY\")TRUE:longitude"
## [7] "housing_median_age:total_rooms"                  
## [8] "housing_median_age:population"

The best_leaps_model_3 uses the following predictors:

  • median_income
  • I(ocean_proximity == \"INLAND\")TRUE
  • I(ocean_proximity == \"NEAR BAY\")TRUE
  • median_income:longitude
  • median_income:latitude
  • I(ocean_proximity == \"NEAR BAY\")TRUE:longitude
  • housing_median_age:total_rooms
  • housing_median_age:population

This model is very similar to the best_leaps_model_2, except that median_income is more important here.

We will also include a fourth model, full_twoway_model, which includes all variables in the model as well as their two-way interactions.

full_twoway_model = twoway_mod_start = lm(median_house_value ~ (.)^2, data = housing_trn_data)

We can now compare the AIC of these four models.

extractAIC(best_leaps_model_1)
## [1]     11.0 321471.3
extractAIC(best_leaps_model_2)
## [1]      9.0 322217.7
extractAIC(best_leaps_model_3)
## [1]      9.0 321721.9
extractAIC(full_twoway_model)
## [1]     64.0 319327.3

We see that the model with the best (i.e., lowest) AIC is full_twoway_model, with a score of \(3.1932733\times 10^{5}\). However, although this model has the lowest AIC, it also has far more predictors (and therefore is more complex) than the other three models - \(64\) compared to \(11\) predictors for best_leaps_model_1, \(9\) predictors for best_leaps_model_2 and \(9\) predictors for best_leaps_model_3. This is just something to keep in mind as we move forward.

Using AIC and BIC

All 4 of these models are good candidates for forward and backward AIC and BIC.

TODOs: - None of the forward steps included a scope, though in the lectures / in the book they always included it. Might be worth stating why this is the case and make an argument for it?

# full_twoway_model
# AIC
back_twoway_mod_finish_aic = step(full_twoway_model, direction = "backward", trace = 0)
forward_twoway_mod_finish_aic = step(full_twoway_model, direction = "forward", trace = 0)
# BIC
n = length(resid(full_twoway_model))
back_twoway_mod_finish_bic = step(full_twoway_model, direction = "backward", k = log(n), trace = 0)
forward_twoway_mod_finish_bic = step(full_twoway_model, direction = "forward", k = log(n), trace = 0)

# best_leaps_model_1
# AIC
back_aic_mod_1 = step(best_leaps_model_1, direction = "backward", trace = 0)
forward_aic_mod_1 = step(best_leaps_model_1, direction = "forward", trace = 0)
# BIC
n = length(resid(best_leaps_model_1))
back_bic_mod_1 = step(best_leaps_model_1, direction = "backward", k = log(n), trace = 0)
forward_bic_mod_1 = step(best_leaps_model_1, direction = "forward", k = log(n), trace = 0)

# best_leaps_model_2
# AIC
back_aic_mod_2 = step(best_leaps_model_2, direction = "backward", trace = 0)
forward_aic_mod_2 = step(best_leaps_model_2, direction = "forward", trace = 0)
# BIC
n = length(resid(best_leaps_model_2))
back_bic_mod_2 = step(best_leaps_model_2, direction = "backward", k = log(n), trace = 0)
forward_bic_mod_2 = step(best_leaps_model_2, direction = "forward", k = log(n), trace = 0)

# best_leaps_model_3
# AIC
back_aic_mod_3 = step(best_leaps_model_3, direction = "backward", trace = 0)
forward_aic_mod_3 = step(best_leaps_model_3, direction = "forward", trace = 0)
# BIC
n = length(resid(best_leaps_model_3))
back_bic_mod_3 = step(best_leaps_model_3, direction = "backward", k = log(n), trace = 0)
forward_bic_mod_3 = step(best_leaps_model_2, direction = "forward", k = log(n), trace = 0)
aic_and_bic_results = data.frame(
  "AIC" =
    c("Backward" =
        c("Two-Way" = extractAIC(back_twoway_mod_finish_aic)[2],
          "Leap M1" = extractAIC(back_aic_mod_1)[2],
          "Leap M2" = extractAIC(back_aic_mod_2)[2],
          "Leap M3" = extractAIC(back_aic_mod_3)[2]),
      "Forward" =
        c("Two-Way" = extractAIC(forward_twoway_mod_finish_aic)[2],
          "Leap M1" = extractAIC(forward_aic_mod_1)[2],
          "Leap M2" = extractAIC(forward_aic_mod_2)[2],
          "Leap M3" = extractAIC(forward_aic_mod_3)[2])),
  "BIC" =
    c("Backward" =
        c("Two-Way" = extractAIC(back_twoway_mod_finish_bic)[2],
          "Leap M1" = extractAIC(back_bic_mod_1)[2],
          "Leap M2" = extractAIC(back_bic_mod_2)[2],
          "Leap M3" = extractAIC(back_bic_mod_3)[2]),
      "Forward" =
        c("Two-Way" = extractAIC(forward_twoway_mod_finish_bic)[2],
          "Leap M1" = extractAIC(forward_bic_mod_1)[2],
          "Leap M2" = extractAIC(forward_bic_mod_2)[2],
          "Leap M3" = extractAIC(forward_bic_mod_3)[2])))

kable(aic_and_bic_results)
AIC BIC
Backward.Two-Way 319323.9 319333.0
Backward.Leap M1 321471.3 321471.3
Backward.Leap M2 322215.7 322215.7
Backward.Leap M3 321721.9 321721.9
Forward.Two-Way 319327.3 319327.3
Forward.Leap M1 321471.3 321471.3
Forward.Leap M2 322217.7 322217.7
Forward.Leap M3 321721.9 322217.7

TODOs: - Jeanette to standardize variable names and how they are referred to (let’s not call them “leaps” since that is just the library that has the function for doing the exhaustive search)

The exhaustive models that we started with (best_leaps_model_1, best_leaps_model_2, and best_leaps_model_3) did not improve significantly. Neither did full_twoway_model which started with a full two-way interaction. However the back_twoway_mod_finish_aic model did have slight improvement over full_twoway_model, which was originally the best performing model.

TODOs: - Not choosing the ones coming out of the full twoway model - we can tie this maybe to the discussion of the first extractAIC and that this model is far more complex and therefore might not be ideal despite its lower AIC??

We now have the three models best_leaps_model_1, back_aic_mod_2 and best_leaps_model_3 to test going forward. Let’s check the assumptions on these three models.

TODOs: - update the diagnostic function, as it is currently just the one from the homework (with change to limit to 5000 to make it work for this dataset) <- I will work on this ~Josh

diagnostics = function(model, pcol = 'grey', lcol = 'dodgerblue', alpha = 0.05, plotit = TRUE, testit = TRUE) {
  if(plotit == TRUE){
    # Two plots, side-by-side
    par(mfrow = c(1, 2))
    # A fitted versus residuals plot
    plot(fitted(model), resid(model), col = pcol, pch = 20, cex = 2, xlab = "Fitted", ylab = "Residuals", main = "Fitted vs. Residuals")
    abline(h = 0, col = lcol, lwd = 2)
    
    # A Normal Q-Q plot of the residuals
    qqnorm(resid(model), col = pcol, pch = 20, cex = 2, main = "Normal Q-Q Plot")
    qqline(resid(model), col = lcol, lwd = 2)
  }
  if(testit == TRUE){
    # NOTE: the shapiro test is limited to the first 5000 records
    # See: https://stackoverflow.com/questions/28217306/error-in-shapiro-test-sample-size-must-be-between
    p_val = shapiro.test(resid(model)[0:5000])$p.value
    decision = ifelse(p_val < alpha, "Reject", "Fail to Reject")
    list(p_val = p_val, decision = decision)
  }
}
diagnostics(best_leaps_model_1)

## $p_val
## [1] 6.065702e-46
## 
## $decision
## [1] "Reject"
diagnostics(back_aic_mod_2)

## $p_val
## [1] 4.701471e-45
## 
## $decision
## [1] "Reject"
diagnostics(best_leaps_model_3)

## $p_val
## [1] 6.605337e-48
## 
## $decision
## [1] "Reject"

TODO: - Discuss the output of the diagnostics plot. - Thinking more about this, don’t these kind of diagnostics happen before model selection, to see if there’s anything else we need to do to transform the predictors or the response so that the data makes more sense? So I’m not sure this section belongs here.

Detect Overfitting

# From the text: http://daviddalpiaz.github.io/appliedstats/variable-selection-and-model-building.html
calc_loocv_rmse = function(model) {
  sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}
calc_loocv_rmse(best_leaps_model_1)
## [1] 67979.89
calc_loocv_rmse(best_leaps_model_3)
## [1] 68573.52
calc_loocv_rmse(back_aic_mod_2)
## [1] 69751.4

The best_leaps_model_1 model has the lowest loocv_rmse.

TODOs: - Explain what this means and why it matters.

Results

Make predictions

# the actual median house values from the test set
actual = housing_tst_data$median_house_value

# predict using the best_leaps_model_1
predicted = predict(best_leaps_model_1, newdata = housing_tst_data)
(mod1_error = 100 * mean((abs(actual - predicted)) / actual))
## [1] 29.26661
# predict using the best_leaps_model_3
predicted = predict(best_leaps_model_3, newdata = housing_tst_data)
(mod2_error = 100 * mean((abs(actual - predicted)) / actual))
## [1] 28.60524
# predict using the back_aic_mod_2
predicted = predict(back_aic_mod_2, newdata = housing_tst_data)
(mod3_error = 100 * mean((abs(actual - predicted)) / actual))
## [1] 29.22784
errors = sort(c(mod1_error, mod2_error, mod3_error))

These three models have very similar percentages of error, ranging from \(28.6052359\)% to \(29.2666087\)%.

Discussion

TODO: these are likely better plots than what I stubbed in above, so they likely should be integrated in the right place.

#lets use Weighted Least Squares on this model to cover potential heteroskedasticity.

back_bic_mod_fitted = fitted(back_bic_mod)
back_bic_mod_resid = resid(back_bic_mod)

temp_wls_mod = lm(log(back_bic_mod_resid^2) ~ back_bic_mod_fitted + back_bic_mod_fitted^2)
ghat = fitted(temp_wls_mod)
hhat = exp(ghat)

WLS_back_bic_mod = update(back_bic_mod, weights = 1 / hhat)# to do: #figure out how to get call

plot(
      back_bic_mod$fitted.values,
      back_bic_mod$residuals
)
plot(
      WLS_back_bic_mod$fitted.values,
      WLS_back_bic_mod$residuals
)